Before pre-processing:
Pre-processed Data:
The statistic tf-idf (term frequency - inverse document frequency) is intended to measure how important a word is to a document in a collection (or corpus) of documents.
The inverse document frequency for any given term is defined as
\[ idf\text{(term)}=\frac{n_{\text{documents}}}{n_{\text{documents containing term}}} \]
Parties want the media agenda to be congruent with their own agenda to define the issue-based criteria on which they will be evaluated by voters ( Eberl, 2017 ).Thus, parties instrumentalize their press releases in order to highlight issues that they are perceived to be competent on, that they “own” and that are important to their voters ( Kepplinger & Maurer, 2004 ). Editors can select from this universe and decide which of these topics will be discussed in the news. In that sense the ideological content of a newspaper refers to the extent to which the topics promoted by the parties correlate with the topics discussed in the news articles.
To discover the latent topics in the corpus of press releases (1.942) and news articles (11.880), a structural topic modeling (STM) developed by Roberts (2016) is applied. The STM is an unsupervised machine learning approach that models topics as multinomial distributions of words and documents as multinomial distributions of topics, allowing to incorporate external variables that effect both, topical content and topical prevalence.
STM assumes a fixed user-specified number of topics. There is not a “right” answer to the number of topics that are appropriate for a given corpus (Grimmer and Stewart 2013). Roberts et al. (2016) propose to measure topic quality through a combination of semantic coherence and exclusivity of words to topics. Semantic coherence is a criterion developed by Mimno et al. (2011) and is closely related to pointwise mutual information (Newman et al. 2010): it is maximized when the most probable words in a given topic frequently co-occur together.
Using the function searchK several automated tests are performed to help choose the number of topics including the average exclusivity and semantic coherence as well as the held out likelihood (Wallach et al. 2009) and the residuals (Taddy 2012).
I included the document source as a control for the topical topical prevalence, assuming that the distribution of topics depends on the sources. The number of topics is set to 50.
library(stm)
library(tidyverse)
library(ggthemes)
library(xtable)
library(viridis)
rm(list = ls())
load("../output/models/finalmodel_50.RDa")
model_df <- model_df %>%
dplyr::mutate(doc_index = as.numeric(rownames(.)),
source = ifelse(source == "welt.de", "DIE WELT", source),
source = ifelse(source == "zeit.de", "ZEIT ONLINE", source),
source = ifelse(source == "focus.de", "FOCUS Online", source),
source = ifelse(source == "bild.de", "Bild.de", source),
source = ifelse(source == "spiegel.de", "SPIEGEL ONLINE", source),
source = ifelse(source == "union", "Union", source),
source = ifelse(source == "spd", "SPD", source),
source = ifelse(source == "afd", "AfD", source),
source = ifelse(source == "gruene", "Grüne", source),
source = ifelse(source == "linke", "Linke", source),
source = ifelse(source == "fdp", "FDP", source)
)
To explore the words associated with each topic we use the words with the highest probability in each topic. As we included the source type (press release or news paper) as a control for the topical content (the word distribution of each topic), we have two different labels for each topic.
sagelabs <- sageLabels(stmOut)
theta <- as.data.frame(stmOut$theta) %>% # get all theta values for each document
mutate(doc_index = as.numeric(rownames(.))) %>%
# convert to long format
gather(topic, theta, -doc_index) %>%
mutate(topic = as.numeric(gsub("V","",topic))) %>%
# join with topic df
left_join(., topics.df, by="topic") %>%
# join with model_df
left_join(., model_df %>%
select(date,type,source,doc_index,title_text), by="doc_index")
For each document, we have a distribution over all topics, e.g.:
sample_doc <- sample(nrow(model_df),1)
# uncomment this to only select docs from press releases
#sample_doc <- theta %>% filter(type=="press") %>% sample_n(1) %>% select(doc_index)
#sample_doc <- sample_doc$doc_index
title <- model_df$title[which(model_df$doc_index == sample_doc)]
source <- model_df$source[which(model_df$doc_index == sample_doc)]
theta %>%
filter(doc_index == sample_doc) %>%
ggplot(aes(reorder(joint_label, desc(topic)), theta)) +
geom_col(fill="#3d648a", alpha = 0.8) +
ylim(c(0,1)) +
coord_flip() +
theme_hc() +
labs(title = paste("Topic distribution of document",sample_doc),
subtitle = paste0("Source: ",source,"\nTitle: ", title),
x = NULL, y = NULL
) +
theme(axis.text = element_text(size = 10))
ggsave("../figs/doc_topic_distr.png", height = 10, width = 10)
What is the document acutally about?
model_df %>%
filter(doc_index == sample_doc) %>%
select(source, title_text) %>%
htmlTable::htmlTable(align="l", rnames=FALSE, header = c("Source", "Title + Body"))
| Source | Title + Body |
|---|---|
| tagesschau.de | Bundesregierung bestätigt: Zwei Deutsche im Irak in Haft | tagesschau.de Linda W., 16-jährige IS-Anhängerin aus Sachsen, wartet in irakischer Haft auf ihre Vernehmung. Im Exklusiv-Interview mit NDR, WDR und SZ berichtet sie über ihre Zeit beim “Islamischen Staat”. | mehr Der “Spiegel” hatte berichtet, dass in Bagdad vier deutsche Frauen in Haft säßen, die sich in den vergangenen Jahren dem IS angeschlossen haben sollen. Sie seien in den Tagen nach der Befreiung Mossuls gefangen genommen worden. Irakische Sicherheitskräfte hatten zuvor angegeben, bei einem Einsatz in Mossul 20 ausländische Dschihadistinnen festgenommen zu haben. Inwiefern es zu einer Auslieferung der Inhaftierten kommen könnte, ist noch offen. Dem Justizministerium zufolge wird mit den irakischen Behörden “über Möglichkeiten der Zusammenarbeit” gesprochen. Ein Auslieferungsabkommen mit dem Irak gebe es aber nicht. Generalbundesanwalt ermittelt Auch in Deutschland wird nun wieder gegen Linda W. ermittelt. Die Generalbundesanwaltschaft habe das Verfahren von der Dresdner Staatsanwaltschaft übernommen, teilte ein Sprecher mit. Einen Haftbefehl gegen die 16-Jährige gebe es aber noch nicht. Ermittelt werde zudem gegen drei weitere von irakischen Sicherheitskräften festgenommene Frauen, bei denen es sich wohl um deutsche Staatsangehörige handele. Der Anfangsverdacht laute auf Mitgliedschaft in einer ausländischen terroristischen Vereinigung. Dafür gebe es entsprechende Anhaltspunkte, sagte der Sprecher. Die Dresdner Anklagebehörde hatte die Ermittlungen erst vor wenigen Wochen gegen Linda W. eingestellt, da es bis dato keine Kenntnis vom Aufenthaltsort der Jugendlichen aus Pulsnitz gab. Über dieses Thema berichtete Deutschlandfunk am 24. Juli 2017 um 14:00 Uhr. Mehr zu diesem Thema: |
The expected proportion of the corpus that belongs to each topic is used to get an initial overview of the results. The figure below displays the topics ordered by their expected frequency across the corpus. The four most frequent words in each topic are used as a label for that topic.
overall_freq <- as.data.frame(colMeans(stmOut$theta)) %>%
transmute(
topic = as.numeric(rownames(.)),
frequency = colMeans(stmOut$theta)
) %>%
left_join(., topics.df, by = "topic") %>%
arrange(desc(frequency))%>%
mutate(order = row_number())
overall_freq %>%
ggplot(aes(reorder(joint_label, -order),
frequency, fill=frequency)) +
geom_col(show.legend = F) +
coord_flip() +
scale_fill_gradient(low = "#dee24e", high = "#421c64") +
theme_hc() +
labs(x=NULL, y=NULL)
ggsave("../figs/topic_proportion.png", height = 10, width = 10)
Agendas were measured in terms of percentage distributions across the 60 topics. For each source the average distribution of each topic is calculated for each month. The following pictures show the overall topic distribution.
topicmean_news <- theta %>%
filter(type == "news") %>%
group_by(topic,joint_label, source) %>%
summarise(topicmean = mean(theta)) %>%
ungroup()
topicmean_press <- theta %>%
filter(type == "press") %>%
group_by(topic,joint_label, source) %>%
summarise(topicmean = mean(theta)) %>%
ungroup()
topicmean_news %>%
group_by(source) %>%
arrange(desc(topicmean), .by_group = TRUE) %>%
mutate(topic_order = row_number()) %>%
ungroup() %>%
group_by(joint_label) %>%
mutate(topicmean_mean = mean(topicmean)) %>%
ungroup() %>%
top_n(70, topicmean_mean) %>%
ggplot(aes(reorder(joint_label, topicmean_mean),
topicmean, label = topic_order,
fill = topic_order)) +
geom_col(show.legend = F) +
geom_text(hjust=-0.1, size=5) +
coord_flip() +
scale_fill_gradient(low = "#00AFBB", high = "#FC4E07") +
facet_wrap(~source, nrow = 1) +
labs(x=NULL, y=NULL) +
theme(axis.text.y = element_text(size=12))
topicmean_press %>%
group_by(source) %>%
arrange(desc(topicmean), .by_group = TRUE) %>%
mutate(topic_order = row_number()) %>%
ungroup() %>%
group_by(joint_label) %>%
mutate(topicmean_mean = mean(topicmean)) %>%
ungroup() %>%
top_n(50, topicmean_mean) %>%
ggplot(aes(reorder(joint_label, topicmean_mean),
topicmean, label = topic_order,
fill=topic_order)) +
geom_col(show.legend = F) +
geom_text(hjust=-0.1, size=5) +
coord_flip() +
scale_fill_gradient(low = "#dee24e", high = "#421c64") +
facet_wrap(~source, nrow = 1) +
labs(x=NULL, y=NULL) +
theme(axis.text.y = element_text(size=12))
Then, we estimated bivariate correlations between party agendas and the mediated party agendas in the online news. These correlations represent the agenda selectivity each party experiences in each media outlet. The higher the correlation, the more congruent both agendas are.
# calculate topic mean by source and month
topicmean <- theta %>%
mutate(
year = lubridate::year(date),
month = lubridate::month(date)
) %>%
group_by(topic,source, month, year) %>%
dplyr::summarise(topicmean = mean(theta)) %>%
ungroup() %>%
spread(source, topicmean) %>%
filter(month != 3)
media <- unique(model_df %>% filter(type == "news") %>% select(source))
parties <- unique(model_df %>% filter(type == "press") %>% select(source))
rm(corrDF)
for (i in parties$source) {
tempdf <- topicmean %>%
group_by(month, year) %>%
do(data.frame(Cor=t(cor(.[,media$source], .[,i])))) %>%
gather(medium, cor, 3:9) %>%
mutate(party = i,
medium = gsub("Cor.","",medium)) %>%
ungroup()
if (exists("corrDF")){
corrDF <- rbind(corrDF,tempdf)
} else {
corrDF <- tempdf
}
}
agenda <- corrDF %>%
mutate(date = as.Date(paste0(year,"/",month,"/1"))) %>%
dplyr::mutate(medium = ifelse(medium == "DIE.WELT", "DIE WELT", medium),
medium = ifelse(medium == "ZEIT.ONLINE", "ZEIT ONLINE", medium),
medium = ifelse(medium == "FOCUS.Online", "FOCUS Online", medium),
medium = ifelse(medium == "SPIEGEL.ONLINE", "SPIEGEL ONLINE", medium)
) %>%
filter(month > 5)
normalize_data <- function(x) {
# normalize data between -1,1
if (is.numeric(x)) {
y <- 2*((x - min(x, na.rm = T)) / (max(x, na.rm = T) - min(x, na.rm = T)))-1
return(y)
} else {
return(x)
}
}
p <- agenda %>%
mutate(
date =as.Date(paste("01",month,year, sep = "-"), format="%d-%m-%Y")
) %>%
ggplot(aes(date, cor, color = medium)) +
geom_line(show.legend = F) +
geom_hline(yintercept = 0, size = 0.3, color = "grey30", linetype = 2) +
facet_wrap(~party) +
scale_color_viridis_d(name = NULL) +
labs(y=NULL, x =NULL)
# guides(colour = guide_legend(nrow = 1)) +
# theme(legend.position = "bottom",
# legend.title = element_blank())
plotly::ggplotly(p, tooltip=c("cor","medium"))
p <- agenda %>%
group_by(party, medium) %>%
summarize(cor = mean(cor, na.rm = T)) %>%
spread(key = party, value = cor) %>%
ggiraphExtra::ggRadar(aes(color = medium),
interactive = T,
alpha = 0,
rescale = F,
legend.position = "bottom")
htmlwidgets::saveWidget(p, "../figs/radarchart.html")
p
p <- agenda %>%
group_by(party, medium) %>%
summarize(cor = mean(cor, na.rm = T)) %>%
spread(key = party, value = cor) %>%
ggiraphExtra::ggRadar(aes(color = medium),
interactive = F,
alpha = 0,
rescale = F,
legend.position = "bottom")
ggsave(filename = "../figs/radarchart.png", p, width =9, height = 7)
Due to political relevance, not all potential topics recieve equal amounts of coverage in media. However, these factors should infuence all media outlets equally. To what extent does the topic correlation of a party in a medium differ from the average topic correlation in the media? To calculate the relative topic correlation, I estimate the deviation of the topic correlation of a party in one medium from the average topic correlation of that party over all news paper.
agenda_diff <- agenda %>%
group_by(party, date) %>%
# estimate average correlation for each party
mutate(cor_by_party = mean(cor, na.rm = T)) %>%
ungroup() %>%
# estimate average correlation for each party-medium pair
mutate(
cor_diff = cor - cor_by_party
)
p <- agenda_diff %>%
ggplot(aes(date, cor_diff, color = medium)) +
geom_line(show.legend = F) +
geom_hline(yintercept = 0, size = 0.3, color = "grey30", linetype = 2) +
facet_wrap(~party) +
scale_color_viridis_d(name = NULL) +
labs(y=NULL, x =NULL)
plotly::ggplotly(p, tooltip=c("cor","medium"))
p <- agenda_diff %>%
group_by(party, medium) %>%
summarize(cor = mean(cor_diff, na.rm = T)) %>%
spread(key = party, value = cor) %>%
ggiraphExtra::ggRadar(aes(color = medium),
interactive = T,
alpha = 0,
rescale = F,
legend.position = "bottom")
p
p <- agenda_diff %>%
filter(!medium == "tagesschau.de") %>%
group_by(party, medium) %>%
summarize(cor = mean(cor_diff, na.rm = T)) %>%
spread(key = party, value = cor) %>%
ggiraphExtra::ggRadar(aes(color = medium),
interactive = T,
alpha = 0,
rescale = F,
legend.position = "bottom")
p